July 26, 2016
Neural random forests
Deep neural decision forests
A decision tree is a logical model represented as a tree that shows how the value of a target variable can be predicted by using the values of a set of predictor (input) variables.
A decision tree recursively partitions the data and sample space to construct the predictive model.
Its name derives from the practice of displaying the partitions as a decision tree, from which the roles of the predictor variables may be inferred.
A classification or regression tree is a prediction model that can be represented as a decision tree.
A tree-structured classifier is a decision tree for predicting a class variable from one or more predictor variables.
A regression tree model is a nonparametric estimate of a regression function constructed by recursively partitioning a data set with the values of its predictor variables.
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields.
Morgan and Sonquist (1964, JASA)
Morgan and Sonquist (1964, JASA)
Morgan and Sonquist (1964, JASA)
An empirical tree represents a segmentation of the data that is created by applying a series of split rules. Each split rule assigns an observation to a segment based on the values of inputs.
One rule is applied after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and each segment is called a node.
The original segment contains the entire data set and is called the root node of the tree. A node with all its successors forms a branch of the node that created it.
The nodes that have child nodes are called interior (or internal, intermediate) nodes. The final nodes that do not have child nodes are called leaves (or terminal nodes). For each leaf, a decision is made and applied to all observations in the leaf.
The type of decision depends on the context. In predictive modeling, the decision is simply the predicted value.
Split rule: determined by examining all possible splits (exhaustive search) or performing statistical tests
Good split rule: the data being homogeneous within each subnode and heterogeneous between subnodes.
Loh, W.-Y. (2008). Regression by parts: Fitting visually interpretable models with GUIDE. In Handbook of Data Visualization, C.Chen, W.Härdle, and A.Unwin, Eds. Springer, pp.447-469.
Loh, W.-Y. (2008). Classification and regression tree methods. In Encyclopedia of Statistics in Quality and Reliability, F.Ruggeri, R.Kenett, and F.W. Faltin, Eds. Wiley, Chichester, UK, pp.315-323.
Loh, W.-Y. (2010). Tree-structured classifiers. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 364-369.
Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14-23.
Loh, W.-Y. (2014). Fifty years of classification and regression trees (with discussion). International Statistical Review.
Merkle, E.C. and Shaffer, V.A. (2011). Binary recursive partitioning: Background, methods, and application to psychology, British Journal of Mathematical and Statistical Psychology, 64, 161–181.
Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 415–434.
Strobl, C., Malley, J. and Tutz, G. (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Psychological Methods, 14(4), 323–348.
Piecewise constant least squares models: AID (Morgan and Sonquist, 1964), CART, RPART, GUIDE (Loh, 2002)
Piecewise linear least squares, polynomial regression, quantile regression, and longitudinal and multi-responses data effects: Loh (2014) with discussion
Subgroup identification of differential treatment effects: Loh et al. (2014), Hothorn et al. (2014)
Others: M5 (Quinlan, 1992), MOB (Zeileis et al., 2008), MELT (Eo and Cho, 2014), KAPS (Eo et al., 2015)
Missing values, selection bias, accuracy, speed, and tree complexity
Diagnostic tool: Simonoff (2013)
Fit a constant, the node mean \(\bar{Y}_t\) at a node \(t\)
Use sum of squared deviations \(\sum_{i \in t} (Y_i - \bar{Y}_t)^2\) as a node impurity \(i(t)\)
Choose the split set that maximizes the decrease in node impurity
Use \(\bar{Y}_t\) in node \(t\) as predicted value.
Prune tree using test sample or cross-validation
Use surrogate spilts to deal with missing values
A naive approach is to evaluate the reduction of impurity, \(\Delta i(\cdot)\), over all possible splits, and select the split set with the greatest reduction of impurity,
\[ \Delta i(s,t) = i(t) - p_L i(t_L) - p_R i(t_R), \]
\[ A = (-\infty, c], \mbox{if $X$ is ordinal} \]
\[ A \in X, \mbox{if $X$ is categorical} \]
Enable tree construction when there are missing values in the learning sample
Enable classification of new cases with missing values
Rank variables by their order of importance (not available in rpart)
Detect masking of variables
For more details, See the talk slide.
See the JCGS paper
At each node, a case goes to the left child node if stated condition is satisfied. Sample sizes are beside terminal nodes.
95% bootstrap intervals for relative risk of treatment vs. placebo below nodes.
Identifying groups of patients for whom the treatment has a different effect than for others. Effect is:
– Stronger
– Lower
– Contrary
than the average treatment effect.
Suitable models promise better prediction of treatment effect and thus individualised treatments.
Regression trees are natural for subgroup identification
Examine residual patterns for each treatment level
Test for treatment interaction
Perturb population instead of dat
The idea discussed here is a simple one that has (perhaps) been underutilized through the years: since the errors are supposed to be unstructured if the model assumptions hold, examining the residuals using a method that looks for unspecified structure can be used to identify model violations. A natural method for this is a regression tree.
Miller (1996) proposed using a CART regression tree for this purpose in the context of identifying unmodeled nonlinearity in linear least squares regression, terming it a diagnostic tree
Su, Tsai, and Wang (2009) altered this idea slightly by simultaneously including both linear and tree-based terms in one model, terming it an augmented tree, assessing whether the tree-based terms are deemed necessary in the joint model. They also note that building a diagnostic tree using squared residuals as a response can be used to test for heteroscedasticity.
The diagnostic trees are not meant to replace examination of residuals or more focused (and powerful) tests of specific model violations; rather, they are an omnibus tool to add to the data analyst’s toolkit to try to help identify unspecified mixed effects model violations.
RE-EM Tree algorithm with diagnostic tools for longitudinal and multi-responses data
A final tree that splits from the root node rejects the null model.
Even though the growing/pruning rules for the tree are not designed to directly control Type I error, it turns out that they do at a roughly .05 level, resulting in a generally conservative test.
For more details, See the talk slide.
For more details, See Wang's Web
See partykit package in order to build a tree-structured model
tree::treerpart::rpartmvpart::mvpartparty::mobpsychotree:psychotreebetareg::betatreeRWeka::M5evtree::evtreeREEMtree::REEMtreevcrpart::vcrpartkaps::kaps melt::melt
Figure 1 of Boulesteix et al. (2012)
There are three parameters for RF algorithms:
By default, in the regression mode of the randomForest package, the parametermtry = ceiling(p / 3), a_n = n, and nodesize = 5.
Algorithm 1 of Biau and Scornet (2016)
RF can be used to rank the importance of variables via two measures of significance:
For more detail, see [Goldstein et al. (2014)].
Set \({\bf X} = (X^{(1)}, \ldots, X^{(p)}).\)
For a forest resulting from the aggregation of \(M\) trees, the MDI of the variable \(X^{(j)}\) is defiend by
\[
\hat{MDI}(X^{(j)}) = \frac{1}{M} \sum_{l = 1}^{M} \sum_{t \in \mathrf{T}_l} p_{n,t} L_{reg, n} (j_{n,t}, z_{n,t}),
\] where
For more details, See useR 2008 talk slide.
For more details, See [the JSCS paper].
Online learning is a method in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training dataset at once
One of the strengths of RF is that they can handle missing data.
RF can naturally be adapted to fit the unbalanced data framework by down-sapling the majority class and growing each tree on a more balanced dataset (Chen et al. 2004; Kuhn and Johnson, 2013).
Fink et al. (2010) use RF for which each tree is traned and allowed to prediction on a particular region in space and time.

One-class classification (unary classification) tries to identify objects of a specific class amongst all objects, by learning from a training set containing only the objects of that class.
| Tree-structured Models | Random Forests |
|---|---|
tree::tree |
|
rpart::rpart |
randomForest::rf |
mvpart::mvpart |
randomForestSRC |
party::mob |
Rborist |
psychotree:psychotree |
ranger::ranger |
betareg::betatree |
party::cforest |
RWeka::M5 |
quantregForest:: |
evtree::evtree |
LogicForest:: |
REEMtree::REEMtree |
|
vcrpart::vcrpart |
|
melt::melt |
For more details, see CRAN Task View
- RF can be viewed as weighted neighborhoods schemes.
+ \(\hat{y} = \sum_{i=1}^{n} W(x_i, x') y_i\).
+ where \(W(x_i, x')\) is the non-negative weight of the \(i\)th training point relative to the new point \(x'\).
+ In \(k\)-NN, the weight \(W(x_i, x') = \frac{1}{k}\) if \(x_i\) is one of the \(k\) points closest to \(x'\), and zero otherwise.
+ In RF, \(W(X_i, x') = \frac{1}{k'}\) if \(x_i\) is one of the \(k'\) points in the same leaf as \(x'\), and zero otherwise.
+ Since a forest averages the predictions of a set of m trees with individual weight functions \(\hat{y} = \sum_{i=1} (\frac{1}{m}\sum_j=1}^{m} W_j(x_i, x') )\).
This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of x' in this interpretation are the points \(x_{i}\) which fall in the same leaf as x' in at least one tree of the forest. In this way, the neighborhood of x' depends in a complex way on the structure of the trees, and thus on the structure of the training set.
For more detail, See Lin and Jeon (2006).
See the talk slide